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meaning to the concepts they are searching. A semantic 
space is created by a lexicon of concepts and relations 
between concepts. A query is mapped to a first meaning 
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semantic space. Similarly, each data element in the target 
data set being searched is mapped to a second meaning 
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in the semantic space. Searching is accomplished by deter- 
mining a semantic distance between the first and second 
meaning differentiator, wherein this distance represents their 
closeness in meaning. Search results on the input query are 
presented where the target data elements that are closest in 
meaning, based on their determined semantic distance, are 
ranked higher. 
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MEANING-BASED INFORMATION FIG. 5 is a system diagram of one embodiment of the 

ORGANIZATION AND RETRIEVAL invention. 



This application claims the benefit of Provisional Appli- 
cation Ser. No. 60A55,667, filed Sep. 22, 1999. s 

FIELD OF THE INVENTION 

The invention relates generally information organization 
and retrieval. More specifically, the invention relates to 
search engine technology. 10 

BACKGROUND 

The Internet, which is a global network of interconnected 
networks and computers, commonly makes available a wide 15 
variety of information through a vehicle known as the world 
wide web (WWW). Currently, hundreds of millions of "web 
sites," that house and format such information in documents 
called web pages are available to users of the Internet. Since 
the content of such pages is ungoverned, unregulated and 2 q 
largely unorganized between one site and the next, finding 
certain desired information is made difficult. 

To aid users in finding sites or pages having information 
they desire, search engines were developed. Search engines 
and directories attempt to index pages and/or sites so that 25 
users can find particular information. Typically, search 
engines are initiated by prompting a user to type in one or 
more keywords of their choosing along with connectors 
(such as "and") and delimiters. Hie search engine matches 
the keywords with documents or categories in an index that 30 
contain those keywords or are indexed by those keywords 
and returns results (either categories or documents or both) 
to the user in the form of URLs (Uniform Resource 
Locators). One predominant web search engine receives 
submissions of sites and manually assigns them to categories 35 
within their directory. When the user types in a keyword, a 
literal sub-string match of that keyword with either the 
description of the site in their index or the name of the 
category occurs. The results of this sub-string search will 
contain some sites of interest, but in addition, may contain ao 
many sites that are not relevant or on point. Though one may 
refine the search with yet more keywords, the same sub- 
string match will be employed, but to the result set just 
obtained. Almost all search engines attempt to index sites 
and documents and leave it to the user to formulate an 45 
appropriate query, and then to eliminate undesired search 
results themselves. Recently other search engines using 
natural language queries have been developed but these also 
often result in many undesired responses. 

The quality of the results obtained varies, but by doing 5 0 
essentially sub-string matches or category browsing, the 
engines are unable to properly discern what the user actually 
intends or means when a particular keyword is entered. 

BRIEF DESCRIPTION OF THE DRAWINGS 55 

The objects, features and advantages of the method and 
apparatus for the present invention will be apparent from the 
following description in which: 

FIG. 1 is a flow diagram of one or more embodiments of 6Q 
the invention. 

FIG. 2 illustrates a portion of a relationship based lexicon 
employed in one or more embodiments of the invention. 

FIG. 3 illustrates the concept of bond strength and seman- 
tic distance in one or more embodiments of the invention. 65 

FIG. 4 illustrates the application of synsets to categories 
in a subject directory tree. 



DETAILED DESCRIPTION 

Referring to the figures, exemplary embodiments of the 
invention will now be described. The exemplary embodi- 
ments are provided to illustrate aspects of the invention and 
should not be construed as limiting the scope of the inven- 
tion. The exemplary embodiments are primarily described 
with reference to block diagrams or flowcharts. As to the 
flowcharts, each block within the flowcharts represents both 
a method step and an apparatus element for performing the 
method step. Depending upon the implementation, the cor- 
responding apparatus element may be configured in 
hardware, software, firmware or combinations thereof. 

The searching paradigm presented herein relies on an 
unconventional approach to information retrieval; namely, 
the idea of a "meaning-based" search. Instead of simply 
indexing words that appear in target documents, and allow- 
ing users to find desired word instances within documents or 
an index, searches are instead conducted within the realm of 
"semantic space", allowing users to locate information that 
is "close in meaning" to the concepts they are interested in. 

A search engine and searching paradigm so implemented 
enables Web users to easily locate subject categories within 
a large subject directory, such as Netscape's Open Directory 
(a product of Netscape Communications Corporation) by a 
convenient and meaningful manner. A "target document" 
refers to a single subject page within such a directory. Such 
a subject directory, is arranged in a roughly hierarchical 
fashion and consists of many unique topics. By allowing 
users to refine their searches to specific meanings of words, 
the invention in its various embodiments enables users to 
quickly filter out undesired responses, and therefore achieve 
more precise and more relevant search results. For example, 
the user would be able to filter out results relating to the 
concept of "Bulls" as a basketball team, because they are 
only interested in the concept of "Bulls" as a kind of cattle. 
Because searches conducted using search engines imple- 
mented according to the invention result in presenting 
conceptual areas "near" to a particular meaning, the user is 
also presented with categories that are likely to be of 
interest, yet might have been missed by a traditional search 
approach. An example would be a result of "Cows" to a 
search on "Bulls", which would come up as a result because 
the concepts are deemed "near" to each other in semantic 
space, 

FIG. 1 is a flow diagram of one or more embodiments of 
the invention. 

The flow diagram in FIG. 1 for implementing meaning- 
based information organization and retrieval system may be 
summarized as follows: 
Pre-Processing/Organization 

1. Define a "semantic space" by creating an intercon- 
nected lexicon of meanings. 

2. Determine the "semantic distance" between meanings 
in semantic space that describes how close, 
conceptually, one meaning is to another. 

3. Designate a "location" in semantic space for each target 
document. 

4. Pre -calculate "scores" for each "target document" for 
each relevant input meaning, based on nearness in 
semantic space obtained by measuring semantic dis- 
tance. 
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Retrieval 

1. Implement a search engine that converts ao input string 
into a set of probable desired meanings, then locates 
targets based mainly on p re-calculated scores. 

2. Create a user-interface to the engine that allows users S 
to easily make "initial" and "refined" searches, and that 
presents results to users in a logical and orderly fashion. 

Referring to FIG. 1 
Block 110 

In the pre-processing or organization stage, the first step 10 
is develop^pdate a useful meaning-based lexicon (block 
110). A "lexicon" is a network of interconnected meanings, 
it describes the "semantic space" that is employed by the 
search engine. One such lexicon already in existence con- 
sists of thousands of meanings, or "synsets", which are 15 
connected to one another through two key relationships: 
"kind of" and "part of. For example, the concept of "table" 
is connected to the concept of "furniture" through a "kind 
of connection. Thus, "table" is a kind of "furniture". 
Similarly, "California" is a part of "United States". 20 

From this basis, the lexicon can be updated and expanded 
to include new meanings or update connections for mean- 
ings already present. New meanings can be updated mainly 
to reflect the importance in everyday culture of a large 
number of proper nouns, which are not available in many 25 
lexicons such as the names of companies, people, towns, etc. 
In addition, new meanings may have to be developed within 
the lexicon in order to cover subject areas of common 
interest with greater specificity. For example, whereas an 
existing lexicon might define the idea of a "programming 30 
language", it may be useful to expand this concept within the 
lexicon to designate the many hundreds of specific program- 
ming languages that may exist as "kind of children to this 
meaning. 

Additionally, an innovative type of relationship between 35 
meanings has been developed as a corollary to the invention, 
in order to convey information that is not treated in lexicons. 
This relationship, called a "bind", describes one meaning's 
closeness to another meaning in people's common under- 
standing. For example, "skier" and "skiing" are not closely 40 
related concepts in existing lexicons. The former is a kind of 
"athlete", ultimately a kind of "human being"; and thus 
would reside within the "entity" or "living thing" tree. The 
latter is a kind of "sport", ultimately a kind of "activity"; it 
is in the "actions" tree. Though the subjects are closely 45 
related in everyday usage, they may be in widely separated 
locations within the lexicon's network of meanings. To 
remedy this, a "bind" has been made between the two 
meanings, to reflect their close proximity in semantic space 
(when you think of one concept, you tend to think of the 50 
other). This new type of bonding between meanings is 
essential for creating a "semantic space" that will yield 
useful search results. 

An extension to this "bind" connection is the concept of 
varying "bond strengths" between meanings. A value can be 55 
assigned to a connection from one meaning to another that 
signifies how strongly the second meaning relates to the first. 
These connection strengths are dependent on the direction of 
the bond, so that, for example, "skier" might imply a strong 
connection to "skiing", whereas "skiing" need not imply 60 
"skier" to the same degree. 

One other enhancement is the addition of a "common- 
ness" value to each meaning, which reflects a combination 
of "how often does this concept arise in everyday usage" and 
"how specific a concept is this"? This value allows both 65 
more accurate interpretation of input terms passed to the 
search engine ("what is more likely to be meant by this?"), 
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as well as improved ranking of search results, showing users 
the results that they are more likely to be interested in first 
(the more specific terms are likely to be more important to 
the user). 

Meanings within the lexicon may also be flagged in 
several new ways: 

Meanings that signify geographical places are marked as 
"locations". This allows special calculations to come 
into play in the search engine that are unique to these 
meanings. 

Meanings that are "offensive" are marked, indicating that 
the general public may find these meanings to be 
distasteful or vulgar in some way. This flag allows us to 
give users the option to filter search results that they 
may find offensive. 

Meanings that signify specific "instances" of things are 
marked. For example, "computer company" is a 
generic kind of concept, but "Microsoft" describes a 
unique entity. Knowing when "kind of children of a 
meaning are "instances" allows the semantic distance 
calculations to more accurately estimate the precision 
of a parent meaning (this is described in more detail 
later). 

The "currentness" of meanings may be noted. Meanings 
marked as "current" are those that are in some way 
timely; values surrounding them are more likely to vary 
over time than other concepts. For example, the word 
"Monica" might currently imply the meaning "Monica 
Lewinsky" to a greater degree today than it will a year 
from now. By marking meanings that are uniquely 
likely to change within the near future, the Lexicon 
may be more easily kept up to date. 
Meanings may be marked as "pseudosynsets". This term 
describes meanings that are either not in common usage 
because they are highly specific or technical, or that 
exist within the Lexicon purely for the purpose of 
forming a parent category for a group of child mean- 
ings. An example of the former might be the Latin 
terms that describe phylum, class, or species within the 
biological taxonomy. An example of the latter would be 
the concept of "field sports", which exists mainly for 
the purpose of grouping similar specific sports together 
cleanly in the Lexicon, rather than because it in itself is 
actually an oft used meaning in common usage. By 
marking "pseudosynsets", more accurate values for 
semantic distance may be calculated (this is described 
in more detail later). 
Block 115 

Each subject, or node, within the directory is given a 
"location" within the semantic space that is established by 
the Lexicon. Before a search engine can find anything within 
this space, targets to retrieve must be placed there. The 
process is envisioned to be a manual one; editors examine 
each directory subject node, and decide upon the "meaning" 
of the subject. However, this process may also be automated. 

To specify a node's location in semantic space, one or 
more individual meanings from the Lexicon are "attached" 
to the node. These meanings may be grouped together into 
"synset groups", each of which, in a sense, describes the 
node's position within a different dimension of semantic 
space. 

For example, a node within the directory that is about 
"Piano Players" would be assigned the meaning piano 
player. A node about "Piano Players in Australia" would be 
assigned two meanings, piano player and Australia. The two 
concepts represent two distinct "synset groups" on the node, 
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as each establishes a locatioa for the node within a different drop-off of distance is not linear as the search expands, 

"dimension" of meaning. Another way to look at this This is a result of an increase in scaling factor based 

example is to say that the node represents the "intersection" linearly on the previous jump distance, 

of the concepts of piano player and Australia — it is where 2. The scaling factor is also modified by a change in 

the two ideas come together within the directory. 5 direction of the search within the lexicon hierarchy. For 

Extending this example, consider a node that is about example, a jump down to a child from a parent that was 

"Piano Players in Australia and New Zealand". In this case previously jumped up to from another child, incurs a 

the meanings Australia and New Zealand might both be scale factor increase penalty. Similar penalties arise 

placed on the node, but grouped together in one "synset from jumps down then up, from jumps in "kind of that 

group", because the combination of the two describes the 10 occur after "part of '(and vice versa), and from combi- 

node's location within the "geographical location" dimen- nations of these. 

sion of meaning. Another way to look at this example would 3. Lateral "bond" type connections also incur scale factor 

be to say that the node is about the intersection of the penalties, based on the set distance of the jump, 

concept of piano player with the union of the concepts 4. "Psuedosynset" and "instance" me anin gs are treated in 

Australia and New Zealand. The purpose of this synset 15 a special way. When used as the origin, they imply that 

grouping is solely to provide more accurate positioning of the search for related meanings should be within a 

the node within semantic space, and a more accurate reflec- smaller radius, as their own greater degree of exactness 

tion of the specificity of the node, both of which result in imply a more specific kind of search for meanings is 

improved search engine retrieval. called for. Thus the search does not expand as far; this 

The lexicon is updated and further developed primarily 20 is controlled by starting the search with a higher scaling 

when ascribing nodes to a location in semantic space is factor. Additionally, a different measurement of preci- 

impossible because the needed meaning does not exist in the sion is used, which includes detailed terms that are 

semantic space. Thus, blocks 110 and 115 can be thought of otherwise excluded from the standard precision algo- 

as implemented in complementary fashion to one another. rithm initially. (Alternately, if the origin meaning is not 

Block 120 25 a pseudo-synset or instance meaning, then the standard 

If the semantic space laid out by the lexicon developed as precision values excluding count of descendant pseu- 

described with respect to block 110 is to be effectively used, dosynsets are used.) 

the concept of "distance" from one meaning to another Block 130 

within this space must be defined. The input to this portion Once distances between meanings within the Lexicon 

of the process is the lexicon itself and the output is a table 30 have been determined, and target nodes within the directory 

of information that details the distance from each meaning have been given fixed positions within the semantic space 

to each other meaning that falls within a certain "radius of described by the Lexicon, it is possible to generate scores for 

semantic closeness". all nodes that fall close to each individual meaning. Pre- 

The closeness of meanings is affected to a large degree by calculating these scores ahead of time allows a much quicker 

their perceived "precision". For example, we can guess at 35 response time to actual searches. The inputs to this process 

how close the concepts of "sports" and "baseball" are based are the lexicon, the table of relatives of each meaning, 

on the fact that there are many other particular kinds of showing semantic distances, and the data that details what 

sports under "sports" than baseball. As baseball appears to meanings have been attached to what nodes. The output of 

be one of many, it's connection to the concept of "sports" is this process is information describing what nodes are close 

not as strong as if, say, there were only two sports in the 40 to each given meaning, and a "score" for that closeness, 

world, and baseball was thus one of only two possibilities which is a direct reflection of the semantic distance from the 

for what is meant by "sports". This idea is reflected in an origin meaning to the meanings that have been attached to 

algorithm that estimates the "kind of" and "part of preci- the node. Other factors that affect the pre-calculated node 

sion of a meaning based on the total count of its descendants, scores for a meaning are the number of meanings attached 

following "kind of and "part of relationships. In these 45 to the node, and the "commonness" value of the meaning in 

calculations, meanings marked as "instances" are biased question. 

against, as they would tend to incorrectly dilute the precision An additional element of this p re-calculation step 

of a concept otherwise. involves the creation of tables of information that allow very 

Differences in estimates of precision are used to generate fast comparison between meaning when determining which 

a semantic distance between two directly connected mean- 50 nodes are hit by multiple meanings simultaneously. For this, 

ings only when a connection strength has not been set. bitmapped information that reflects a compressed version of 

Manual settings override the calculated estimates; thus the the node— meaning score information is generated, 

semantic distance results come about from a combination of Essentially, if a meaning has any score whatsoever for a 

automatically estimated connection strengths, and strengths particular node, a single bit is set to 1 in a binary field, 

that have been manually set. 55 marking that node as hit. By comparing the "bitmaps" for 

The process for discovering meanings that are semanti- two meanings against each other, a quick assessment of 

cally close to a specific meaning involves a traditional "combination hits" can be made — it is simply a process of 

breadth-first search outward from the origin meaning. performing a bitwise AND operation on the two bitmaps. 

Neighboring meanings in the network of nodes are explored Because an uncompressed bitmap for every meaning 

in an outward seeking fashion, and distance from the origin eo would be an unmanageably large amount of data, which 

is tracked. When a certain radius has been reached, the would also require a lot of processing time to analyze, the 

search stops. Intricacies in this search include the following: bitmap data is compressed. There are two levels of 

1. A "scaling factor", somewhat like a "velocity" is compression, each a level 8 times more condensed than the 

tracked as the search spreads outward. This scaling last. In the first level of compression, a single bit represents 

factor multiplies the perceived distance for a single 65 a set of 8 nodes. If a meaning has a score for any of these 

jump. One net effect of this factor is to reduce the nodes, the bit is set. In the second level of compression, each 

perceived distance to meanings that are close, thus the bit corresponds to a set of 64 nodes, in the same way. 
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Storing the bitmap information in this way allows large and compared against each other, so that relationships might 

chunks of empty raw bitmap data to be ignored, resulting in be discovered. If there is a connection between two 

a much smaller data set. In addition, the bitwise operations meanings, those meanings will receive a bonus to their 

performed on these bitmaps can be done with greater speed, probability factor, because the implication is that those 

because detailed sections of the bitmap data do not have to s particular meanings of the user's words were what the user 

be examined unless the higher-order compressed version wan ted. (These comparisons actually occur between the all 

indicates that there is information of value present there. me meanings that are possibilities for one search word 

Retrieval against all those for each other search word). Thus if the user 

Block 140 enters « Xurkey Poultry", the meaning of "turkey" as a kind 

While pre-processing and information organization may „ c c a m • u u . • ■ . 

be an ongoing process! at anytime when, at least a partial 10 of will receive a bonus, because a connection between 

table of scores for nodes is available, the user can then input a waning denving from poultry relates to this particular 

a specific search term as he/she would when using the mcanm S of • ™ B 15 cxtremel y valuable in tuning 

conventional search engine meaning probabilities, because without this weighting, for 

When users use the search engine to find subjects within example, the meaning "Turkey, the country" might have 

the directory, the search engine conducts two phases of 15 been preferred. 

processing. The first phase, interpretation, involves analyz- Lastly, a set of simple plain words is compiled, based on 
ing the user's input so that the meanings that the user desires the raw input terms, and given weighting factors based on 
can be identified (see block ISO description). The second whether or not meanings were implied by those terms. These 
phase, collection and ranking of results, involves the col- plain words are used for a more standard word-matching 
lecting of nodes that have good scores for the desired 20 search, that is conducted concurrently with the meaning- 
meanings, and ordering them based on predicted relevance based search. Weighted by a lesser degree than meaning- 
(see block 160 description). based results, hits on subject nodes based on these plain 
Block 150; Interpretation Phase words do play a factor in the scoring of nodes in the second 

There are two key inputs possible for a search: an input phase of the search, 

string, and a set of known meanings. Either, or both, may be 25 Processing of preconceptions in the Interpretation phase is 

received and processed for results. The input string is simply simpler, as the meanings that the user desires are passed as 

a set of words that the user types into a search box and direct input to the engine. The only processing that is 

submits for query. An interpretation of these words is made, necessary is a weighting on the importance of these 

to map words to probable desired meanings. In addition to, meanings, which is applied by analyzing the commonness of 

or instead of, the input string, a set of meanings that are 30 the meanings in questioa 

known ahead of time to be of interest, may be passed to the One additional factor remains to be mentioned in the 

search engine for query. These require less processing, as Interpretation phase. Certain meanings may be considered as 

plain words to not have to be mapped to meanings, as they "required" in the search results. Users can specify they want 

are already predefined. Meanings passed to the engine in this to require certain meanings to show up in the results by 

way are called preconceptions. 35 putting a plus sign (+) before the appropriate word or phrase 

The process for interpreting the input string begins with in the input string. Additionally, preconceptions may or may 

stemming, or morphing, of the individual words in the not be sent to the engine as "required" meanings. By default, 

string. This involves mainly an attempt to reduce words the engine performs a logical OR operation on all input 

perceived as being plurals to their singular form. This is meanings. When meanings are "required", it implies that a 

necessary because word mappings to meanings within the 40 logical AND should be performed on these terms. 

Lexicon are stored in singular form only, except in special Block 160: Collection and Ranking Phase 

cases. Once the meanings of interest to the user have been 

Next, all possible combinations of words (and their mor- determined in the Interpretation phase, and given appropri- 

phed variants) are examined for possible matches to mean- ate weightings based on the likelihood of value, actual nodes 

ings. Larger numbers of combined words are tested first, to 45 that relate to these meanings can be collected and ordered for 

give preference over individual words (for example, "United presentation. 

States" must take precedence over analysis of "United" and This phase begins with the collection of bitmap informa- 

"States"). Partial matches with meanings are possible, so tion on the desired meanings. Only meanings that have more 

that "Rocky Horror" might bring up a match for "Rocky than a set number of node scores have bitmap information, 

Horror Picture Show", for example. so these meanings are called popular. (Exception: if the search 

As the set possible meanings is being compiled, prob- is being performed for a single input word, these bitmaps do 

abilities are assigned to each. These values reflect the not have to be collected.) Discovering bitmap information 

likelihood that the user really means a certain concept. stored for meanings indicates to the engine their popular 

Because many words have multiple meanings, probabilities state. 

for implied meanings for words may be manually or auto- 55 Next, the "top scoring nodes" for all meanings are queried 

matically pre-assigned. These values are used in this phase from the pre-calculated score information. Meanings that 

of the engine processing, in order to estimate what meanings were not found to be popular have their bitmaps constructed 

are most likely implied by particular search words. Other during this stage, as all of their scored nodes will be present 

factors that affect the probabilities given to meanings are: in this "top scoring" category. Unless the search is to be a 

was the meaning matched by a morphed word or the word 60 special AND operation between terms, we begin to compile 

in its "pure" form (favor pure forms); was the meaning only a list of potential node results from these score table queries, 

partially matched the input word(s) (if so, reduce (If the search is a pure OR, these results are likely to be 

probability); was the meaning the result of a match on useful. If an AND is to be done, the chances are good that 

multiple words (if so, increase probability); the commonness most of these results will be filtered out, therefore all node 

of the meaning implied (favor more common meanings). 65 scoring is performed later in this case.) 

Another kind of "concept induction" is applied to the The bitmap analysis phase comes next. Behavior of the 

analysis at this point. All implied meanings are examined engine here varies widely between cases where different 
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kinds of terms are required/not required. In general, the 
point of bitmap analysis is for either of the following 
reasons, or for a combination of them: 

Find out what nodes are hit by more than one meaning, 
because they are likely to yield very good scores due to 5 
the combination hit factor. Many of these nodes will not 
have shown up in the individual meanings* "top scoring 
nodes" lists, and therefore would have been missed had 
we not looked specifically for combinations. 
Filter out only nodes that show up on required meanings, 10 
because a logical AND between terms is being per- 
formed. 

Bitmap processing involves a series of bitwise OR and 
AND operations, which occur in two passes. Processing is 
done first on the highest level of compression, to get a rough 15 
idea of what areas of the bitmaps need to be examined in 
more detail. Next the detailed bitmap information of interest 
is queried from the database. Finally, the logical processing 
is run again at the uncompressed level of detail. 

The results of the bitmap processing is a set of nodes of 20 
interest. These nodes are queried from the score tables for all 
meanings of interest and added to the list of potential results 
nodes. 

The next stage is actual scoring of node results. Node 
scores result primarily from a multiplication of the score for 25 
the node for a given meaning and the probability factor for 
that meaning that came out of the Interpretation phase. 
Additional bonuses are given to nodes that are hit by a 
combination of meanings. The "portion of meaning" of the 
node is also considered. For example, if a node has three 30 
attached meaning groups (synset groups), and two of those 
meanings were queried by the user, we can say roughly that 
2 Aof the concept behind this node was of interest to the user. 
Thus this node is probably of less interest to the user than 
one who's concept was hit more completely. Other special 35 
considerations are also introduced, such as the favoring of 
nodes that are hit "perfectly" by meanings, over those that 
were separated at all by some distance in semantic space. 

It should also be mentioned here that some additional 
tricky processing comes into play when dealing with the 40 
simple plain word hits, whose processing is concurrent to the 
meaning-based search processing that this paper focuses on. 
Special processing for plain words is performed that 
involves searching for matches of different plain word 
search terms on the same sentences on the subject node 45 
pages that are to be pulled up as results. Additionally, 
different weighting comes into play based on where plain 
words appear — for example, a word showing up in the title 
of a node is valued more than the appearance of that word 
later on the page. 50 

After scoring, all that is left in anticipation of results 
presentation is a sorting of the node results by score, the 
selecting out of those nodes whose scores merit 
presentation, and the classification of these nodes into 
groups of "strong 1 *, "medium", and "weak" score values. 55 
Block 170 

As mentioned above, users interact with the search engine 
directly by entering an input string into a simple input box 
on a Web page. In a standard initial query, whose span is the 
entire directory, this string is passed to the engine for 60 
interpretation without any preconceptions as parameters. 
Results spanning all subjects in the system are retrieved and 
presented to the user so they may subsequently navigate to 
those areas of the directory. 

Once the user is on a particular subject page of the 65 
directory, however, they also have the option to perform a 
narrow search. Essentially, this is a search for subject matter 
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that is "close in meaning"to the subject page they are 
currently on. These kinds of searches are performed by the 
passing of the meanings attached to the given subject to the 
search engine as preconceptions, along with the user's input 
string. In this case, the preconceptions are designated as 
required, and the input string terms, as a whole, are also 
treated as required terms. 

Whenever a user performs a search, the node results are 
preceded by a description of what meanings were actual 
searched on. Because these meanings are often the result of 
the interpretation of the user's input string, and may include 
multiple possible meanings for given words, the user is 
encouraged here to specify exactly what meanings they 
wanted, and to search again. This second search is called a 
refined search. Essentially, the user is presented with a set of 
checkboxes, each of which corresponds to a possible 
intended meaning. The user refines the search by simply 
checking off the meanings she wishes to search on, and 
clicking on "Search Again". 

A refined search is passed to the search engine as a set of 
nonrequired preconceptions, thus an OR operation is per- 
formed on all of the meanings. Increased functionality in this 
part of the system, including the option to set specific 
meanings as "required", as well as the ability to include 
plain word searching in refined searches, is planned for the 
near future. 
Block 180 

Node results, i.e. target documents, having the desired 
meaning or closeness in meaning to the concept searched for 
may then shown to the user. The user in turn may navigate 
the nodes presented him or refine his search in accordance 
with block 170 described above. Once nodes are collected 
and ranked in accordance with block 160 (also described 
above), then they may shown to the user as results with the 
highest ranked nodes appearing first. This concludes the 
retrieval stage where the user has now successfully navi- 
gated to the site of interest. As compared with conventional 
Web search engines and directories, searching in a meaning- 
based fashioned as described above allows users to more 
quickly locate relevant and useful documents or sites. 

FIG. 2 illustrates a portion of a relationship based lexicon 
employed in one or more embodiments of the invention. 

FIG. 2 shows one portion of a sample lexicon 200 which 
differs from conventional lexicons by adding a lateral bond 
("bind") connection between elements. The boxed elements 
in FIG. 2 represent meanings within the lexicon and 
collectively, along with the relationship connections 
between meanings can be viewed as defining a semantic 
space. The three basic relationship types "part of", "kind of" 
and "bind" are represented by differing line types in FIG. 2, 
a legend for which is drawn thereon. 

The relationships between elements may take on many 
forms and can become quite complex, but for ease of 
illustration, a simple is shown in FIG. 2 dealing with skiing. 

Starting with the branch for "sport", "skiing" is defined in 
the lexicon 200 as a kind of "sport". The word "ski" 
typically, in its noun form , can be thought of as related to 
"skiing" in that it is a part of "skiing" as shown in FIG. 2. 
"Slalom skiing" is a type of skiing and hence a kind of 
connection is shown between it and "skiing". "Bindings" are 
a structural attachment on a ski, and hence it is assigned a 
part of connection with "ski". The example of a specific 
brand of ski, "K2 ski," is given to show how it is in a "kind 
of connection with "ski". 

Unique to the lexicon developed for the invention, "K2 
ski" is also assigned a lateral bond showing a conceptual 
commonness with the manufacturer of the ski "K2 " which 
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lies in the "company" branch. The company branch has as "Tennis." The subject "Skiing" has its own children "Skiing 

child "athletic equipment company" as a kind of company." in California", "Skiing in Colorado" "Skiing in Alaska" and 

"Athletic equipment company" has as its child in turn the "Skiing in Washington". In the semantic space "Skiing in 

"K2 " company. California" would be assigned to "skiing" as well as to 

Considering "ski" once again, "ski" is also a child of the 5 "California". The node "Skiing in California" would thus be 

"equipment" branch which has "athletic equipment" as a ascribed the element "skiing" as well as "California" such 

kind of "equipment" and ski as a kind of "athletic equip- that botn meanings and related meanings would be available 

ment". "Surfboard" is related to "ski" in that it too is a kind as additional refinements. For instance, consider "Los 

of "athletic equipment". Target documents or nodes within ^S e J es ' * in California. If another node of directory 

a subject directory may be "placed" or "located" by human 10 410 descnbed Things ^to do m U>s .Angeles by virtue o 

.1. j a j i_ i me connectedness of Los Angeles as part of California 

intervention into me semantic space as defined by lexicon .... 4 . *u- j i u . j 

- nA A . ,, . „ , . . • * 4 - i i * * within the semantic space, this node may also be presented 

200. A website that sells skis or has information about skiing tQ ^ user wheQ fl ^ fof « Califomia , or ^ m 

destinations would fall somewhere within the defined California" is performed even though none of the literal 

semantic space based upon its focus of content. sub-strings match with the node. TTic meanings or closeness 

FIG. 3 illustrates the concept of bond strength and seman- is m concept, would bring such relevancy of nodes to the 

tic distance in one or more embodiments of the invention. forefront. 

Using the same exemplary lexicon 200 of FIG. 2, FIG. 3 FIG. 5 is a system diagram of one embodiment of the 
illustrates how distance and closeness of meaning between invention. Illustrated is a computer system 510, which may 
meanings can be quantified within the semantic space. be any general or special purpose computing or data pro- 
Distances are shown between the element "ski" and all other 20 cessing machine such as a PC (personal computer), coupled 
elements within the semantic space. Using three classes of to a network 500. 

bond strengths the degree of closeness between meanings One of ordinary skill in the art may program computer 
may be discovered. A "strong relationship" exists between system 510 to act as a meaning-based search engine. This 
"ski" and "skiing" as does between "ski" and "athletic may be achieved using a processor 512 such as the Pen- 
equipment." Between "skiing" and "sport" there is a weaker 25 tium® processor (a product of Intel Corporation) and a 
than strong relationship known as a "medium relationship". memory 511, such as RAM, which is used to store/load 
This is because when you think of the root terra "skiing*' one instructions, addresses and result data as needed. The 
doesn't quickly think also of "sport". Going from "ski" to application(s) used to perform the functions of a meaning- 
"skiing" however, the average person would more likely based information organization and retrieval system may 
associate or think "skiing" if given the term "ski". The 30 derive from an executable compiled from source code 
direction in the arrows in the bond strengths, indicates the written in a language such as C++. The instructions of that 
direction of association. "A-»B" in FIG. 3 means that if you executable file, which correspond with instructions neces- 
are given A, how likely is it or closely would one associate sary to scale the image, may be stored to a disk 518, such as 
the meaning B. Going the other direction between the same a floppy drive, hard drive or CD-ROM 517, or memory 511. 
two elements may produce a different bond strength. A 35 The lexicon, directory and scoring and distance tables and 
"weak relationship" would be displayed between "ski" and other such information may be written to/accessed from disk 
"K2 ski" (when you think of "ski," "K2 ski" doesn't closely 518 or similar device. The software may be loaded into 
come to mind). However, if one were to go from "K2 ski" memory 511 and its instructions executed by processor 512. 
to "ski" this might be construed as a strong relationship since Computer system 510 has a system bus 513 which facili- 
one would naturally associate "ski" if given "K2 ski". to tates information transfer to/from the processor 512 and 
FIG. 3 also shows semantic distances between elements. memory 511 and a bridge 514 which couples to an I/O bus 
"Ski" and "skiing" have only a distance of 2 between them 515, I/O bus 515 connects various I/O devices such as a 
while "skiing" and "sport" have a distance of 5(7-2). The network interface card (NIC) 516, disk 518 and CD-ROM 
distance between "ski" and "sport" is 7. When travelling 517 to the system memory 511 and processor 512. The NIC 
from parent to child or vice-versa, the distances can be 45 516 allows the meaning-based search engine software 
simply added/subtracted but when changing the direction of executing within computer system 510 to transact data, such 
travel, a penalty may be imposed upon the distance calcu- as queries from a user, results of such queries back to users 
lation. Take for example the distance between "ski" and that present meaning-based results and refinements to 
"athletic equipment company". Judging merely on a linear searches performed, with users connected to network 500. 
basis, the distance might be 12. But since the path from "ski" 50 Filters and other meaning-based search utilities may be 
to "athletic equipment" switches direction twice (it starts distributed across network 500. 

down to "K2 ski" and then across the lateral bond to "K2 " Many such combinations of I/O devices, buses and 

and then up to "athletic equipment company") a penalty or bridges can be utilized with the invention and the combi- 

scaling factor would cause the distance between "ski" and nation shown is merely illustrative of one such possible 

"athletic equipment" to be much larger than just 12 espe- 55 combination. 

cially given their lack of connectedness. As described above The exemplary embodiments described herein are pro- 
penalties may be added when the direction of traversal is vided merely to illustrate the principles of the invention and 
switched or when a lateral bond is crossed. Meaning -by- should not be construed as limiting the scope of the inven- 
meaning, distances between elements may be calculated and lion. Rather, the principles of the invention may be applied 
stored for future use in search retrieval. 60 to a wide range of systems to achieve the advantages 

FIG. 4 illustrates the application of synsets to categories described herein and to achieve other advantages or to 

in a subject directory tree. satisfy other objectives as well. 

Given a lexicon 400 (similar to lexicon 200) and a subject We claim: 

directory 410, FIG. 4 illustrates how "Skiing in California" 1. A method comprising: 

may be assigned a location n the semantic space. In the 65 organizing concepts according to their meaning into a 

subject directory 410, the subject Sports has associated with lexicon, said lexicon defining elements of a semantic 

it the child subjects of "Football", "Skiing" "Baseball" and space; 



05/27/2004, EAST Version: 1.4.1 



US 6,453315 Bl 

13 14 

specifying relationships between concepts; and 7. A method according to claim 6 wherein said search is 
determining a semantic distance from a first concept to a conducted by ranking elements of said target data set accord- 
second concept, said semantic distance representing m S t0 conceptual relevance. 

closeness in meaning between said first concept and 8 - A method according t0 claim 6 wher ein said concepts 

said second concept, wherein said semantic distance is $ m ^ bc marked 35 at lcast onc of a geographical location, 

calculated by evaluating steps along a semantic path offensive > uni ^ e instance > and time1 ^ and where «"* 

between said first concept and said second concept and markin S s can be used to filter elements from tar & et data 

applying a dynamic scaling factor to a perceived dis- set 50 thal tar S et data elements said markin S s can be 

tance of each step along the semantic path according to 10 Panted from being presented as search results, 

types of relationships followed, directionality of the 9 * A method according to claim 6 further comprising: 

relationships and changes in direction along the seman- enabling a user to select at least one meaning from the set 

tic path, and number of competing relationships fol- of possible meanings for the input query to provide the 

lowed at each step correct interpretation of the input query for use as input 

2. A method according to claim 1 further comprising 15 t0 tne ^re- 
determining new relationships between concepts in said 10 * A method according to claim 6 wherein said target 
lexicon by determining said semantic distance between data ^ includes documents. 

concepts, defining a radius of semantic distance about a 11. A method according to claim 10 wherein said docu- 

given concept and inferring a relationship between said 20 menls irjclude documents accessible via the world-wide 

concepts, excluding concepts falling in distances beyond we ^* 

said radius. 12. A method according to claim 10 wherein the meaning 

3. A method according to claim 1 wherein said elements differentiators for the documents are determined in an inter- 
are related by a connection, said connections including a pretation phase by mapping each word in the document to 
lateral bind, a kind of and a part of. 25 probable desired meanings. 

4. Amethod according to claim 3 wherein said connection 13 - A method according to claim 12 wherein the inter- 
has an associated strength representing the degree to which pretation phase uses the relationships between concepts 
said elements are related. defined by the lexicon to increase the likelihood of meanings 

5. A method according to claim 4 wherein said strength 30 of each word which have relationships to meanings of other 
from a first element to a second element may be different words in the document. 

from the strength from said second element to said first 14. Amethod according to claim 6 wherein the target data 

element. set includes subjects in a directory. 

6. A method of searching a data set comprising: 15 A method according to claim 6 wherein the input 

j. * *u • • * * 35 query may be a text string consisting of words, 

organizing concepts according to their meaning into a ^ b . , 

, . -j 1 • ^ a • 1 16. Amethod according to claim 15 wherein the meaning 

lexicon, said lexicon defining elements of a semantic . . . . . . . . , 

differentiator is determined in an interpretation phase by 

mapping each word in an input string to probable desired 

providing a first meaning differentiator in response to an meanings 

input query, wherein said first meaning differentiator is *o 1? A mc|hod {Q ^ u wh&rein gaid inter _ 

a set of concepts from said lexicon that represent a first pretation ^ u&es ^ relationships between concepts 

location of said query in the semantic space, defined by the lcxicQn tQ incrcase ^ likelihood of mcanings 

providing a second meaning differentiator for each ele- 0 f eacn WO rd which have relationships to meanings of other 

ment of a target data set, wherein said second meaning 45 words in the input, 

differentiator is a set of concepts from said lexicon that jg. a method according to claim 6 wherein the input 

represent a second location of said target data element q Uerv mav be a set of predetermined concepts, 

in e semantic space; 19. Amethod according to claim 6 wherein the target data 

determining a semantic distance from the first meaning element may be a set of predetermined concepts, 

differentiator to the second meaning differentiator, 50 20. A method according to claim 6 wherein the concepts 

wherein the semantic distance is calculated by evalu- are given a commonness value; and wherein the search is 

ating steps along a semantic path between the first conducted to improve the ranking of elements of said target 

meaning differentiator and the second meaning differ- data set according to commonness of the concepts, 

entiator and applying a dynamic scaling factor To a ss 21. A method according to claim 6 wherein the meaning 

perceived distance of each step along the semantic path differentiator is an intersection or a union of concepts from 

according to the types of relationships followed, direc- the lexicon. 

tionality of the relationships and changes in direction 22. Amethod according to claim 6 wherein the target data 

along The semantic path, and number of competing elements are preindexed according to the concepts in their 

relationships followed at each step, and 60 meaning differentiators to improve the speed of the search, 

presenting results of a search conducted on the target data 23. A method according to claim 6 wherein a user can 

set for target data elements close in meaning to an input initiate a secondary search for documents which are close in 

query, wherein the closeness in meaning is determined meaning to at least one of the search results, 

by the semantic distance between the first meaning 65 24. An information handling system comprising: 

differentiator for said input query and The second means for organizing concepts according to their meaning 

meaning differentiator for each target data element. into a lexicon that defines elements of a semantic space; 



05/27/2004, EAST Version: 1.4.1 



US 6,453315 Bl 



15 



means for providing a first meaning differentiator in 
response to an input query, wherein the first meaning 
differentiator is a set of concepts from the lexicon 
representing a first location in the semantic space; 

means for providing a second meaning differentiator for 
each element of a target data set, wherein the second 
meaning differentiator is a set of concepts from the 
lexicon representing a second location in the semantic 
space; and 10 

means for determining a semantic distance from the first 
location in the semantic space to the second location in 
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the semantic space, wherein the semantic distance 
represents closeness in meaning between the first loca- 
tion in the semantic space and the second location in the 
semantic space, wherein search results are presented 
for target data elements close in meaning to the input 
query and the closeness in meaning is determined by 
the semantic distance between the first meaning differ- 
entiator for said input query and the second meaning 
differentiator for each target data element. 
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